Goto

Collaborating Authors

 medical specialty


CoT-X: An Adaptive Framework for Cross-Model Chain-of-Thought Transfer and Optimization

Bi, Ziqian, Chen, Kaijie, Wang, Tianyang, Hao, Junfeng, Peng, Benji, Song, Xinyuan

arXiv.org Artificial Intelligence

Chain-of-Thought (CoT) reasoning enhances the problem-solving ability of large language models (LLMs) but leads to substantial inference overhead, limiting deployment in resource-constrained settings. This paper investigates efficient CoT transfer across models of different scales and architectures through an adaptive reasoning summarization framework. The proposed method compresses reasoning traces via semantic segmentation with importance scoring, budget-aware dynamic compression, and coherence reconstruction, preserving critical reasoning steps while significantly reducing token usage. Experiments on 7{,}501 medical examination questions across 10 specialties show up to 40% higher accuracy than truncation under the same token budgets. Evaluations on 64 model pairs from eight LLMs (1.5B-32B parameters, including DeepSeek-R1 and Qwen3) confirm strong cross-model transferability. Furthermore, a Gaussian Process-based Bayesian optimization module reduces evaluation cost by 84% and reveals a power-law relationship between model size and cross-domain robustness. These results demonstrate that reasoning summarization provides a practical path toward efficient CoT transfer, enabling advanced reasoning under tight computational constraints. Code will be released upon publication.


Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases

Mingole, Bonam, Majumdar, Aditya, Choudhury, Firdaus Ahmed, Kraschnewski, Jennifer L., Sundar, Shyam S., Yadav, Amulya

arXiv.org Artificial Intelligence

The proliferation of Large Language Models (LLMs) in high-stakes applications such as medical (self-)diagnosis and preliminary triage raises significant ethical and practical concerns about the effectiveness, appropriateness, and possible harmfulness of the use of these technologies for health-related concerns and queries. Some prior work has considered the effectiveness of LLMs in answering expert-written health queries/prompts, questions from medical examination banks, or queries based on pre-existing clinical cases. Unfortunately, these existing studies completely ignore an in-the-wild evaluation of the effectiveness of LLMs in answering everyday health concerns and queries typically asked by general users, which corresponds to the more prevalent use case for LLMs. To address this research gap, this paper presents the findings from a university-level competition that leveraged a novel, crowdsourced approach for evaluating the effectiveness of LLMs in answering everyday health queries. Over the course of a week, a total of 34 participants prompted four publicly accessible LLMs with 212 real (or imagined) health concerns, and the LLM generated responses were evaluated by a team of nine board-certified physicians. At a high level, our findings indicate that on average, 76% of the 212 LLM responses were deemed to be accurate by physicians. Further, with the help of medical professionals, we investigated whether RAG versions of these LLMs (powered with a comprehensive medical knowledge base) can improve the quality of responses generated by LLMs. Finally, we also derive qualitative insights to explain our quantitative findings by conducting interviews with seven medical professionals who were shown all the prompts in our competition. This paper aims to provide a more grounded understanding of how LLMs perform in real-world everyday health communication.


Human-AI collectives produce the most accurate differential diagnoses

Zöller, N., Berger, J., Lin, I., Fu, N., Komarneni, J., Barabucci, G., Laskowski, K., Shia, V., Harack, B., Chu, E. A., Trianni, V., Kurvers, R. H. J. M., Herzog, S. M.

arXiv.org Artificial Intelligence

Artificial intelligence systems, particularly large language models (LLMs), are increasingly being employed in high-stakes decisions that impact both individuals and society at large, often without adequate safeguards to ensure safety, quality, and equity. Yet LLMs hallucinate [1-4], lack common sense [5], and are biased [6, 7]--shortcomings that may reflect LLMs' inherent limitations and thus may not be remedied by more sophisticated architectures, more data, or more human feedback. Relying solely on LLMs for complex, high-stakes decisions is therefore problematic. Here we present a hybrid collective intelligence system that mitigates these risks by leveraging the complementary strengths of human experience and the vast information processed by LLMs. We show that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles. This result holds across a range of medical specialties and professional experience, and can be attributed to humans' and LLMs' complementary contributions that lead to different kinds of errors. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains [8] like medical diagnostics. Diagnostic errors are among the most pressing issues in medical practice [9-11], causing an estimated 795,000 deaths and permanent disabilities in the United States alone each year [12]. Reducing diagnostic errors--without incurring substantially higher costs--is essential to improve patient outcomes worldwide. This challenge has motivated a recent surge in diagnostic technologies exploiting artificial intelligence (AI) to interpret medical records, tests, and images [13, 14]. Deep learning approaches in medical imaging have shown great promise. Notable examples include mammography interpretation, cardiac function assessment, and lung cancer screening, some of which have progressed beyond the testing phase and entered clinical practice [15-17]. Recent years have also witnessed the rise of AI foundation models, especially LLMs, which show remarkable abilities to process natural language, providing accurate answers to questions in almost any domain, including medicine [18-21]. However, a recent meta-analysis [22] found that physicians often outperform LLMs, and that LLMs differ vastly in performance, also between medical specialties.


More doctors use ChatGPT to help with busy workloads, but is AI a reliable assistant?

FOX News

Dr. AI will see you now. It might not be that far from the truth, as more and more physicians are turning to artificial intelligence to ease their busy workloads. Studies have shown that up to 10% of doctors are now using ChatGPT, a large language model (LLM) made by OpenAI -- but just how accurate are its responses? WHAT IS ARTIFICIAL INTELLIGENCE (AI)? A team of researchers from the University of Kansas Medical Center decided to find out.


Enhancing Medical Specialty Assignment to Patients using NLP Techniques

Solomou, Chris

arXiv.org Artificial Intelligence

The introduction of Large Language Models (LLMs), and the vast volume of publicly available medical data, amplified the application of NLP to the medical domain. However, LLMs are pretrained on data that are not explicitly relevant to the domain that are applied to and are often biased towards the original data they were pretrained upon. Even when pretrained on domainspecific data, these models typically require time-consuming fine-tuning to achieve good performance for a specific task. To address these limitations, we propose an alternative approach that achieves superior performance while being computationally efficient. Specifically, we utilize keywords to train a deep learning architecture that outperforms a language model pretrained on a large corpus of text. Our proposal does not require pretraining nor fine-tuning and can be applied directly to a specific setting for performing multi-label classification. Our objective is to automatically assign a new patient to the specialty of the medical professional they require, using a dataset that contains medical transcriptions and relevant keywords. To this end, we fine-tune the PubMedBERT model on this dataset, which serves as the baseline for our experiments. We then twice train/fine-tune a DNN and the RoBERTa language model, using both the keywords and the full transcriptions as input. We compare the performance of these approaches using relevant metrics. Our results demonstrate that utilizing keywords for text classification significantly improves classification performance, for both a basic DL architecture and a large language model. Our approach represents a promising and efficient alternative to traditional methods for finetuning language models on domain-specific data and has potential applications in various medical domains


Overview of Current Applications of Large Language Models in Various Medical Specialities

Mumtaz, Ummara, Ahmed, Awais, Mumtaz, Summaya

arXiv.org Artificial Intelligence

This paper gives an overview of the latest applications of Large Language Models (LLMs) in the healthcare sector, highlighting their transformative role in enhancing medical care quality. By processing vast amounts of data from diverse medical domains, LLMs have become pivotal in assisting doctors, healthcare providers, and patients. We explore their utilization in various medical specialties, such as cancer diagnostics, dentistry, nephrology, dermatology, etc. The paper includes the LLM methodologies applied in various medical specialties, different data types in the medical domains and the relevant input formatting for LLMs, along with practical use-cases of LLMs in the healthcare domain.


Top 10 medical specialties using AI/machine learning-enabled devices

#artificialintelligence

The vast majority of FDA-approved medical devices enabled by artificial intelligence or machine learning are concentrated in radiology and cardiovascular care, according to an analysis by Rock Health. Rock Health used data from FDA clearances and approvals from 1997 to 2021 to determine where these devices are used the most. Here are the AI/machine-learning enabled devices by therapeutic area, the Oct. 8 report found:


The rise of AI in medicine

#artificialintelligence

By now, it's almost old news that artificial intelligence (AI) will have a transformative role in medicine. Algorithms have the potential to work tirelessly, at faster rates and now with potentially greater accuracy than clinicians. In 2016, it was predicted that'machine learning will displace much of the work of radiologists and anatomical pathologists'. In the same year, a University of Toronto professor controversially announced that'we should stop training radiologists now'. But is it really the beginning of the end for some medical specialties?


Using Machine Learning to Assess Physician Competence: A... : Academic Medicine

#artificialintelligence

Purpose: To identify the different machine learning (ML) techniques that have been applied to automate physician competence assessment and evaluate how these techniques can be used to assess different competence domains in several medical specialties. Method: In May 2017, MEDLINE, EMBASE, PsycINFO, Web of Science, ACM Digital Library, IEEE Xplore Digital Library, PROSPERO, and Cochrane Database of Systematic Reviews were searched for articles published from inception to April 30, 2017. Studies were included if they applied at least one ML technique to assess medical students', residents', fellows', or attending physicians' competence. Information on sample size, participants, study setting and design, medical specialty, ML techniques, competence domains, outcomes, and methodological quality was extracted. MERSQI was used to evaluate quality, and a qualitative narrative synthesis of the medical specialties, ML techniques, and competence domains was conducted.